Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ANR due to the tokio runtime being blocked by getaddrinfo when dropped. #7210

Merged
merged 7 commits into from
Nov 22, 2024

Conversation

dlon
Copy link
Member

@dlon dlon commented Nov 20, 2024

Use hickory-dns to resolve {ipv4,ipv6}.am.i.mullvad.net, which is non-blocking, instead of GAI.

Fix DES-1409.


This change is Reviewable

@dlon dlon requested a review from Rawa November 20, 2024 12:22
Copy link

linear bot commented Nov 20, 2024

@albin-mullvad albin-mullvad added the Android Issues related to Android label Nov 20, 2024
Copy link
Contributor

@Rawa Rawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 23 of 23 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

Rawa
Rawa previously approved these changes Nov 21, 2024
Copy link
Contributor

@Rawa Rawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

Copy link
Collaborator

@albin-mullvad albin-mullvad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 23 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

Pururun
Pururun previously approved these changes Nov 21, 2024
Copy link
Contributor

@Pururun Pururun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 23 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

@dlon dlon dismissed stale reviews from Pururun and Rawa via 9df62bc November 21, 2024 10:26
Copy link
Contributor

@hulthe hulthe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 17 of 23 files at r1.
Reviewable status: 18 of 24 files reviewed, 7 unresolved discussions (waiting on @albin-mullvad, @dlon, @Pururun, and @Rawa)


talpid-core/src/connectivity_listener.rs line 109 at r1 (raw file):

        sender: UnboundedSender<Connectivity>,
    ) -> Result<(), Error> {
        let sender_ptr = Box::into_raw(Box::new(sender)) as jlong;

The lifetime of this pointer should be clearly documented


talpid-core/src/connectivity_listener.rs line 204 at r1 (raw file):

            .map_err(|cause| Error::FindMethod("ConnectivityListener", method, cause))?;

        env.call_method_unchecked(self.object.as_obj(), method_id, return_type, parameters)

How come we're using call_method_unchecked instead of the safe version?


talpid-core/src/connectivity_listener.rs line 220 at r1 (raw file):

    let connected = JNI_TRUE == connected;

    let sender = unsafe { Box::from_raw(sender_address as *mut UnboundedSender<Connectivity>) };

Should have // SAFETY-reasoning.

Might also be worth to enable the undocumented_unsafe_blocks clippy lint for this module


mullvad-api/src/lib.rs line 326 at r1 (raw file):

        Box::pin(async move {
            let addrs = tokio::task::spawn_blocking(move || (host, 0).to_socket_addrs())

Maybe comment on why we do this thread dance.


mullvad-daemon/src/lib.rs line 617 at r1 (raw file):

            Ok(listener) => listener,
            Err(error) => {
                log::warn!(

Nit: This seems like a fatal error, log::error is more appropriate, no?


mullvad-daemon/src/android_dns.rs line 1 at r1 (raw file):

#![cfg(target_os = "android")]

btw why is this defined in Daemon and not mullvad_api where the trait is? Because it depends on talpid-core?


mullvad-daemon/src/android_dns.rs line 4 at r1 (raw file):

//! A non-blocking DNS resolver. `getaddrinfo` tends to prevent the tokio runtime from being
//! dropped, since it waits indefinitely on blocking threads. This is particularly bad on Android,
//! so we use a non-blocking resolver instead.

Minor: When there's a module that exports a single type, I personally prefer putting the docs on the type itself, not the module. Either way though, I think the module docs or the struct docs should link to each other where the real docs are, i.e:

/// See [module docs](self)
pub struct AndroidDnsResolver {}

// or

mod foo {
    //! See [Thingy].

    pub struct Thingy;
}

Copy link
Contributor

@MarkusPettersson98 MarkusPettersson98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 14 of 23 files at r1, 5 of 6 files at r2.
Reviewable status: 23 of 24 files reviewed, 12 unresolved discussions (waiting on @dlon and @hulthe)


talpid-core/src/connectivity_listener.rs line 220 at r1 (raw file):

Previously, hulthe (Joakim Hulthe) wrote…

Should have // SAFETY-reasoning.

Might also be worth to enable the undocumented_unsafe_blocks clippy lint for this module

+1


mullvad-daemon/src/lib.rs line 623 at r2 (raw file):

                return Err(Error::DaemonUnavailable);
            }
        };

⛏️ std::result::Result::inspect_err + std::result::Result::map_err

#[cfg(target_os = "android")]
let connectivity_listener = match ConnectivityListener::new(android_context.clone())
    .inspect_err(|err| {
        log::warn!(
            "{}",
             error.display_chain_with_msg("Failed to start connectivity listener")
        );
    })
    .map_err(|_| Error::DaemonUnavailable)?;

Code quote:

        #[cfg(target_os = "android")]
        let connectivity_listener = match ConnectivityListener::new(android_context.clone()) {
            Ok(listener) => listener,
            Err(error) => {
                log::warn!(
                    "{}",
                    error.display_chain_with_msg("Failed to start connectivity listener")
                );
                return Err(Error::DaemonUnavailable);
            }
        };

mullvad-daemon/src/android_dns.rs line 16 at r2 (raw file):

pub struct AndroidDnsResolver {
    connectivity_listener: talpid_core::connectivity_listener::ConnectivityListener,
}

⛏️ Import talpid_core::connectivity_listener::ConnectivityListener, do not qualify the ConnectivityListener type

Code quote:

pub struct AndroidDnsResolver {
    connectivity_listener: talpid_core::connectivity_listener::ConnectivityListener,
}

mullvad-daemon/src/android_dns.rs line 38 at r2 (raw file):

                format!("Failed to retrieve current servers: {err}"),
            )
        })?;

Tip: Use std::io::Error::other

Code quote:

        let ips = ips.map_err(|err| {
            io::Error::new(
                io::ErrorKind::Other,
                format!("Failed to retrieve current servers: {err}"),
            )
        })?;

mullvad-daemon/src/android_dns.rs line 46 at r2 (raw file):

        let lookup = resolver.lookup_ip(host).await.map_err(|err| {
            io::Error::new(io::ErrorKind::Other, format!("lookup_ip failed: {err}"))
        })?;

Tip: Use std::io::Error::other

Code quote:

        let lookup = resolver.lookup_ip(host).await.map_err(|err| {
            io::Error::new(io::ErrorKind::Other, format!("lookup_ip failed: {err}"))
        })?;

mullvad-jni/src/lib.rs line 126 at r2 (raw file):

        _ = context.runtime.block_on(context.running_daemon);

        // Dropping the tokio runtime will block if there are any tasks in flight.

Could we specify what "in flight" means in this context? I suppose tasks refer to tokio tasks, and if so, the tokio runtime will fail to drop if any currently running task never yields at an await point (i.e. if it performs blocking I/O). WDYT?

Code quote:

       // Dropping the tokio runtime will block if there are any tasks in flight.

mullvad-api/src/https_client_with_sni.rs line 422 at r1 (raw file):

            .ok_or_else(|| io::Error::new(io::ErrorKind::Other, "Empty DNS response"))?;
        Ok(SocketAddr::new(*addr, port.unwrap_or(DEFAULT_PORT)))
    }

This many early returns warrants a doc-comment imo. If someone changes this functions, what invariants do they need to uphold? I.e. how is the sought address resolved.

I know that it coould be considered out of scope for this PR, but it is also a good oppurtunity to apply the boy scout rule 😊

Code quote:

    async fn resolve_address(
        address_cache: AddressCache,
        dns_resolver: &dyn DnsResolver,
        uri: Uri,
    ) -> io::Result<SocketAddr> {
        const DEFAULT_PORT: u16 = 443;

        let hostname = uri.host().ok_or_else(|| {
            io::Error::new(io::ErrorKind::InvalidInput, "invalid url, missing host")
        })?;
        let port = uri.port_u16();
        if let Ok(addr) = hostname.parse::<IpAddr>() {
            return Ok(SocketAddr::new(addr, port.unwrap_or(DEFAULT_PORT)));
        }

        // Preferentially, use cached address.
        //
        if let Some(addr) = address_cache.resolve_hostname(hostname).await {
            return Ok(SocketAddr::new(
                addr.ip(),
                port.unwrap_or_else(|| addr.port()),
            ));
        }

        // Use DNS resolution as fallback
        //
        let addrs = dns_resolver.resolve(hostname.to_owned()).await?;
        let addr = addrs
            .get(0)
            .ok_or_else(|| io::Error::new(io::ErrorKind::Other, "Empty DNS response"))?;
        Ok(SocketAddr::new(*addr, port.unwrap_or(DEFAULT_PORT)))
    }

mullvad-problem-report/src/lib.rs line 303 at r2 (raw file):

    )
    .await
    .map_err(Error::CreateRpcClientError)?;

Code smell? 😉

Code quote:

    let api_runtime = mullvad_api::Runtime::with_cache(
        // This is irrelevant since no DNS lookups will be made
        DefaultDnsResolver,
        cache_dir,
        false,
        #[cfg(target_os = "android")]
        None,
    )
    .await
    .map_err(Error::CreateRpcClientError)?;

mullvad-api/src/lib.rs line 364 at r1 (raw file):

        handle: tokio::runtime::Handle,
        dns_resolver: impl DnsResolver,
    ) -> Result<Self, Error> {

Strictly speaking, do we always need to supply a DnsResolver when creating a new api runtime? Couldn't we use DefaultDnsResolver by default and expose the option to configure the DnsResolver for an existing runtime on-the-fly?

Code quote:

    pub fn new(
        handle: tokio::runtime::Handle,
        dns_resolver: impl DnsResolver,
    ) -> Result<Self, Error> {

mullvad-api/src/lib.rs line 325 at r2 (raw file):

        Ok(addrs.map(|addr| addr.ip()).collect())
    }
}

Document what mechanism that is used for DNS lookups in this implementation (https://doc.rust-lang.org/std/net/trait.ToSocketAddrs.html#tymethod.to_socket_addrs)

Code quote:

pub struct DefaultDnsResolver;

#[async_trait]
impl DnsResolver for DefaultDnsResolver {
    async fn resolve(&self, host: String) -> io::Result<Vec<IpAddr>> {
        use std::net::ToSocketAddrs;
        let addrs = tokio::task::spawn_blocking(move || (host, 0).to_socket_addrs())
            .await
            .expect("DNS task panicked")?;
        Ok(addrs.map(|addr| addr.ip()).collect())
    }
}

mullvad-daemon/Cargo.toml line 55 at r2 (raw file):

android_logger = "0.8"
async-trait = "0.1"
hickory-resolver = { version = "0.24.1" }

⛏️ Promote to a workspace dep?

Code quote:

hickory-resolver = { version = "0.24.1" }

Copy link
Member Author

@dlon dlon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 12 of 27 files reviewed, 9 unresolved discussions (waiting on @hulthe, @MarkusPettersson98, and @Rawa)


mullvad-api/src/https_client_with_sni.rs line 422 at r1 (raw file):

Previously, MarkusPettersson98 (Markus Pettersson) wrote…

This many early returns warrants a doc-comment imo. If someone changes this functions, what invariants do they need to uphold? I.e. how is the sought address resolved.

I know that it coould be considered out of scope for this PR, but it is also a good oppurtunity to apply the boy scout rule 😊

Done. I want to refactor AddressCache into something that implements DnsResolver and falls back on some other DnsResolver if something isn't in its "cache". But that's definitely scope creep.


mullvad-api/src/lib.rs line 326 at r1 (raw file):

Previously, hulthe (Joakim Hulthe) wrote…

Maybe comment on why we do this thread dance.

Done.


mullvad-api/src/lib.rs line 364 at r1 (raw file):

Previously, MarkusPettersson98 (Markus Pettersson) wrote…

Strictly speaking, do we always need to supply a DnsResolver when creating a new api runtime? Couldn't we use DefaultDnsResolver by default and expose the option to configure the DnsResolver for an existing runtime on-the-fly?

I slightly prefer the explicit way since it forces the caller to select an appropriate resolver.


mullvad-api/src/lib.rs line 325 at r2 (raw file):

Previously, MarkusPettersson98 (Markus Pettersson) wrote…

Document what mechanism that is used for DNS lookups in this implementation (https://doc.rust-lang.org/std/net/trait.ToSocketAddrs.html#tymethod.to_socket_addrs)

Done.


mullvad-daemon/src/android_dns.rs line 1 at r1 (raw file):

Previously, hulthe (Joakim Hulthe) wrote…

btw why is this defined in Daemon and not mullvad_api where the trait is? Because it depends on talpid-core?

Yes, depending on all of talpid-core seems excessive. I am tempted to move ConnectivityListener out of talpid-core, though.


mullvad-daemon/src/android_dns.rs line 4 at r1 (raw file):

Previously, hulthe (Joakim Hulthe) wrote…

Minor: When there's a module that exports a single type, I personally prefer putting the docs on the type itself, not the module. Either way though, I think the module docs or the struct docs should link to each other where the real docs are, i.e:

/// See [module docs](self)
pub struct AndroidDnsResolver {}

// or

mod foo {
    //! See [Thingy].

    pub struct Thingy;
}

Nice. I'll adopt that convention.


mullvad-daemon/src/lib.rs line 617 at r1 (raw file):

Previously, hulthe (Joakim Hulthe) wrote…

Nit: This seems like a fatal error, log::error is more appropriate, no?

Fixed!


mullvad-daemon/src/lib.rs line 623 at r2 (raw file):

Previously, MarkusPettersson98 (Markus Pettersson) wrote…

⛏️ std::result::Result::inspect_err + std::result::Result::map_err

#[cfg(target_os = "android")]
let connectivity_listener = match ConnectivityListener::new(android_context.clone())
    .inspect_err(|err| {
        log::warn!(
            "{}",
             error.display_chain_with_msg("Failed to start connectivity listener")
        );
    })
    .map_err(|_| Error::DaemonUnavailable)?;

👍


mullvad-jni/src/lib.rs line 126 at r2 (raw file):

Previously, MarkusPettersson98 (Markus Pettersson) wrote…

Could we specify what "in flight" means in this context? I suppose tasks refer to tokio tasks, and if so, the tokio runtime will fail to drop if any currently running task never yields at an await point (i.e. if it performs blocking I/O). WDYT?

Added some more information. This is mostly a reminder in case we end up in the same situation.


mullvad-problem-report/src/lib.rs line 303 at r2 (raw file):

Previously, MarkusPettersson98 (Markus Pettersson) wrote…

Code smell? 😉

Replaced with a null resolver.


talpid-core/src/connectivity_listener.rs line 109 at r1 (raw file):

Previously, hulthe (Joakim Hulthe) wrote…

The lifetime of this pointer should be clearly documented

Done. (This is no longer relevant.)


talpid-core/src/connectivity_listener.rs line 204 at r1 (raw file):

Previously, hulthe (Joakim Hulthe) wrote…

How come we're using call_method_unchecked instead of the safe version?

unchecked is faster since the class isn't looked up each time a method is called: https://docs.rs/jni/latest/jni/struct.JNIEnv.html#checked-and-unchecked-methods.

I switched to the simpler safe method since I doubt it matters here.


talpid-core/src/connectivity_listener.rs line 220 at r1 (raw file):

Previously, MarkusPettersson98 (Markus Pettersson) wrote…

+1

Done. (No longer relevant.)

Copy link
Contributor

@MarkusPettersson98 MarkusPettersson98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 6 files at r2, 11 of 14 files at r3, 3 of 4 files at r4.
Reviewable status: 24 of 28 files reviewed, 8 unresolved discussions (waiting on @dlon, @hulthe, and @Rawa)


mullvad-api/src/lib.rs line 364 at r1 (raw file):

Previously, dlon (David Lönnhager) wrote…

I slightly prefer the explicit way since it forces the caller to select an appropriate resolver.

Sure :)


talpid-core/src/connectivity_listener.rs line 109 at r1 (raw file):

Previously, dlon (David Lönnhager) wrote…

Done. (This is no longer relevant.)

🙌


mullvad-api/src/https_client_with_sni.rs line 392 at r4 (raw file):

    /// Resolve the provided `uri` to an IP and port using `address_cache` in the first place, and
    /// using `dns_resolver` otherwise.

Isn't this a bit misleading as address_cache might not be used if if let Ok(addr) = hostname.parse::<IpAddr>() { return Ok(..) } is hit?

Code quote:

    /// Resolve the provided `uri` to an IP and port using `address_cache` in the first place, and
    /// using `dns_resolver` otherwise.

talpid-core/src/connectivity_listener.rs line 170 at r4 (raw file):

        // No sender has been registered
        return;
    };

It is probably wise to log if this branch is ever hit. trace or debug should be fine

Code quote:

    let Some(tx) = &*CONNECTIVITY_TX.lock().unwrap() else {
        // No sender has been registered
        return;
    };

mullvad-daemon/src/android_dns.rs line 35 at r4 (raw file):

        let ips = ips.map_err(|err| {
            io::Error::other(format!("Failed to retrieve current servers: {err}"))
        })?;

⛏️ Drop one of the ips bindings

Code quote (i):

        let ips = self.connectivity_listener.current_dns_servers();

        let ips = ips.map_err(|err| {
            io::Error::other(format!("Failed to retrieve current servers: {err}"))
        })?;

Code snippet (ii):

let ips = self
    .connectivity_listener
    .current_dns_servers()
    .map_err(|err| {
        io::Error::other(format!("Failed to retrieve current servers: {err}"))
    })?;

Copy link
Member Author

@dlon dlon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 21 of 28 files reviewed, 7 unresolved discussions (waiting on @hulthe, @MarkusPettersson98, and @Rawa)


mullvad-api/src/https_client_with_sni.rs line 392 at r4 (raw file):

Previously, MarkusPettersson98 (Markus Pettersson) wrote…

Isn't this a bit misleading as address_cache might not be used if if let Ok(addr) = hostname.parse::<IpAddr>() { return Ok(..) } is hit?

Done.


talpid-core/src/connectivity_listener.rs line 170 at r4 (raw file):

Previously, MarkusPettersson98 (Markus Pettersson) wrote…

It is probably wise to log if this branch is ever hit. trace or debug should be fine

Done.

Copy link
Contributor

@MarkusPettersson98 MarkusPettersson98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status: 25 of 28 files reviewed, 6 unresolved discussions (waiting on @dlon, @hulthe, and @Rawa)


mullvad-api/src/https_client_with_sni.rs line 392 at r4 (raw file):

Previously, dlon (David Lönnhager) wrote…

Done.

But the comment is still missleading, isn't it? It implies that address_cache will be used, but it might not. Consider talking about URI first, and then get to address_cache and dns_resolver 😊

Copy link
Contributor

@MarkusPettersson98 MarkusPettersson98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r7, all commit messages.
Reviewable status: 25 of 28 files reviewed, 5 unresolved discussions (waiting on @hulthe and @Rawa)

@dlon dlon force-pushed the android-tokio-shutdown-timeout branch from 8c260f4 to 4a60f97 Compare November 22, 2024 11:48
Copy link
Contributor

@hulthe hulthe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 6 files at r2, 9 of 14 files at r3, 3 of 4 files at r4, 1 of 4 files at r5, 2 of 3 files at r6, 1 of 1 files at r7, 2 of 2 files at r8, all commit messages.
Reviewable status: 25 of 28 files reviewed, all discussions resolved (waiting on @Rawa)


mullvad-daemon/src/android_dns.rs line 1 at r1 (raw file):

Previously, dlon (David Lönnhager) wrote…

Yes, depending on all of talpid-core seems excessive. I am tempted to move ConnectivityListener out of talpid-core, though.

Yeah, ´talpid-corefeels like a catch-all library.talpid-net? or maybe it's time for talpid-android`? Food for thought, but this is not a blocker imo.

@dlon dlon force-pushed the android-tokio-shutdown-timeout branch from 4a60f97 to 12e8251 Compare November 22, 2024 12:07
Copy link
Contributor

@Rawa Rawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 6 files at r2, 9 of 14 files at r3, 2 of 4 files at r4, 4 of 4 files at r5, 2 of 3 files at r6, 1 of 1 files at r7, 2 of 2 files at r8, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

Copy link
Contributor

@Rawa Rawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

@dlon dlon force-pushed the android-tokio-shutdown-timeout branch from 12e8251 to ad2fc60 Compare November 22, 2024 12:38
@dlon dlon merged commit f629200 into main Nov 22, 2024
62 checks passed
@dlon dlon deleted the android-tokio-shutdown-timeout branch November 22, 2024 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Android Issues related to Android
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants